-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(prometheus_scrape source): scrape endpoints in parallel #17660
fix(prometheus_scrape source): scrape endpoints in parallel #17660
Conversation
✅ Deploy Preview for vector-project ready!
To edit notification comments on pull requests, go to your Netlify site settings. |
✅ Deploy Preview for vrl-playground canceled.
|
This still needs some additional work:
|
Thanks for opening this!
This seems like a separate issue no? Or is that behavior introduced by this PR?
I'm not sure I follow this one. The normal pattern would be to just execute the tasks async and let tokio manage scheduling. I think that's what is happening here?
This feels like a nice-to-have. I wouldn't consider it blocking this PR. |
So before if you have a slow Prometheus endpoint, the single request will wait until it is finished. This does mean that, say you were scraping every 15 seconds - you may actually only get metrics every 30 seconds. Technically there is some data loss. Memory usage remains constant. With this PR, the call to scrape occurs every 15 seconds regardless of how slow the endpoint is. The result is that if the scrape takes longer than 15 seconds the request is placed in a queue. This queue can get longer and longer and memory usage will creep up. Presumably Vector will eventually be killed. I had to push things fairly hard to get a significant increase in memory. 10 endpoints scraped every second with a 20 second lag time for each request was using about 5gb after 30 minutes or so. So I'm not sure how much of a problem this would be in the real world. Ideally we should implement the workarounds suggested by @wjordan. We just need to decide if we should do that before merging this or after? |
I'd say before merging, I doubt we'd find the time to prioritize that unless/until someone reports it as a bug - which isn't a great user experience. |
@wjordan i couldn't push to your branch, but i added timeouts here #18021. i'll make the timeouts configurable tomorrow 👍 one thing i noticed was that adding lots of endpoints, there was a bit of "set up" time where all the futures get created first and then they're run. so for example, with like 20 endpoints, when the first "tick" started, it was about 7 seconds for the actual requests to go out. not particularly problematic, but it was unexpected and not something i fully understand 🤔 edit: fixed ☝️ in my pr |
@nullren has been making further progress on this in #18021, so I'm closing this PR in favor of that one, and add some followup to previous discussion here:
My Rust / Tokio knowledge is limited, so I could be completely mistaken on this one. As I understand it, the current setup runs the scrapes async but not in separate tasks, so tokio does the network io in parallel, but can't run the request-processing load (HTTP parsing, event enrichment etc) across many threads, unless each client-request future is also wrapped in
For reference, the real-world use cases both @nullren and I are working with involve thousands of endpoints- I had to kill Vector within seconds in my local testing. |
Closing in favor of #18021 |
…timeouts (#18021) <!-- **Your PR title must conform to the conventional commit spec!** <type>(<scope>)!: <description> * `type` = chore, enhancement, feat, fix, docs * `!` = OPTIONAL: signals a breaking change * `scope` = Optional when `type` is "chore" or "docs", available scopes https://github.com/vectordotdev/vector/blob/master/.github/semantic.yml#L20 * `description` = short description of the change Examples: * enhancement(file source): Add `sort` option to sort discovered files * feat(new source): Initial `statsd` source * fix(file source): Fix a bug discovering new files * chore(external docs): Clarify `batch_size` option --> fixes #14087 fixes #14132 fixes #17659 - [x] make target timeout configurable this builds on what @wjordan did in #17660 ### what's changed - prometheus scrapes happen concurrently - requests to targets can timeout - the timeout can be configured (user facing change) - small change in how the http was instantiated --------- Co-authored-by: Doug Smith <dsmith3197@users.noreply.github.com> Co-authored-by: Stephen Wakely <stephen@lisp.space>
Fixes #17659.